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Abstract 

A  parallel  computing  system  becomes  increasingly  prone  to  failure  as  the  number  of  pro- 
cessing elements  in  it  increeises.  In  this  paper,  we  describe  a  completely  general  strategy 
that  takes  £in  arfcjirary  step  of  am  ideal  CRCW  PRAM  and  au<oma<jca//y  translates  it  to  run 
efficiently  and  robustly  on  a  PRAM  in  which  processors  are  prone  to  failure.  The  strategy 
relies  on  efficient  robust  zilgorithms  for  solving  a  core  problem,  the  Write- All  Problem.  This 
problem  is  critical,  because,  as  we  show,  its  complexity  is  equal  to  that  of  any  general  strat- 
egy for  realizing  robustness  in  our  model.  We  aoialyze  the  expected  parallel  time  and  work  of 
various  algorithms  for  solving  this  problem  and  prove  a  lower  bound.  Our  general  strategy 
is  also  applicable  in  the  context  of  asynchronous  PRAMs  given  asynchronous  algorithms  for 
the  Write- All  problems. 


'The  research  of  this  author  was  supported  in  part  by  the  Office  of  Naval  Research  under  contract  number 
N00014-85-K-0046  and  by  the  National  Science  Foundation  under  grant  number  CCR-89-6949. 


1      Introduction 

With  hardware  becoming  cheaper,  it  is  expected  that  "massively  parallel  systems"  with  large 
numbers  of  processors  will  both  increase  the  speed  of  computations  and  decrease  their  cost. 
Unfortunately,  as  the  number  of  processing  elements  grows,  certain  difficulties  need  to  be  ad- 
dressed, among  them  asynchrony  and  processor  faults.  These  two  problems  are  related,  but  in 
this  paper  we  concentrate  on  processor  faults  only.  Clearly,  the  larger  the  number  of  processors, 
the  greater  the  probability  of  some  processors  failing. 

Much  of  the  recent  emphasis  in  algorithm  design  for  PRAMs  for  a  variety  of  basic  problems 
such  as  list  ranking  [TV84,CV86],  sorting  [C86],  forest  matching  [KP88],  pattern  matching 
[KLP89]  and  others,  has  emphasized  efficiency  and  optimal  speedup.  These  extremely  efficient 
(ideally  optimal  speedup)  paxcdlel  algorithms  have  very  little  "slack"  in  that  every  step  of  the 
algorithm  is  essential.  Therefore,  quite  often,  they  do  not  terminate  correctly  when  perturbed 
by  a  few  simple  processor  failures. 

Because  of  this,  it  is  of  paramount  importance  to  study  the  design  of  robust  and  efficient 
parallel  algorithms.  In  this  paper  it  is  our  goal  to  develop  a  general  methodology  for  imple- 
menting ideal  CRCW  PRAMs  robustly  on  faulty  CRCW  PRAMs.  We  adopt  the  approach  of 
"graceful  degradation,"  so  that  if  the  processors  fail-stop  during  the  computation,  as  long  as  at 
least  one  processor  remains  operational,  the  PRAM  program  can  continue  executing  correctly 
independent  of  its  semantics.  In  addition,  we  want  to  minimize  the  performance  penalty  in- 
curred in  robust  implementations.  We  present  a  technique  that  takes  an  arbitrary  step  of  an 
ideal  CRCW  PRAM  designed  to  run  on  U  "virtual"  processors,  and  execute  it  on  a  CRCW 
PRAM  of  the  same  type  with  P  <  U  processors  that  are  prone  to  fadlure^.  (The  number  of 
virtual  processors  U,  that  is  the  parallelism  width,  can  vary  from  step  to  step.)  By  modeling 
these  failures  probabilistically,  we  analyze  the  expected  work  and  parallel  time  complexities  of 
a  variety  of  deterministic  and  probabilistic  schemes  for  realizing  these  robust  implementations. 

KaneUakis  and  Shvartsman  [KS89]  were  the  first  to  formalize  this  notion  of  robustness 
within  the  context  of  synchronous  parallel  computation.  They  developed  a  failure  model  (to  be 
described  later),  which  we  use  here.  In  [KS89],  they  considered  the  Write- All  pvohlem,  defined 
below,  and  described  its  deterministic  robust  implementation. 

Write- All  problem:  Given  an  array  x[l..U]  initialized  to  0,  set  x[i]  :=  1  for  all  i.  (We  will 
also  refer  to  such  writing  of  1  as  "marking. ") 

Their  algorithm  (henceforth  referred  to  as  the  KS-algorithm  for  short)  iteratively  estimates 
the  amount  of  remaining  unwritten  locations  and  remaining  processors,  and  reschedules  the 
processors  to  other  locations  for  writing  I's.  It  is  easy  to  see  that  their  algorithm  for  the 
Write- All  problem  can  be  used  for  computing  associative  functions,  such  as  max. 

Using  the  Write- All  algorithm,  Kanellakis  and  Shvartsman,  designed  robust  algorithms  for 
fundamental  problems  such  as  list  ranking.  In  [KS89]  they  also  analyze  the  deterministic 
worst  case  complexity  of  the  KS-algorithm.  They  also  observed  various  specific  complexity 
improvements  in  the  case  where  P  <  U.  Specifically,  they  paramaterized  the  KS-algorithm  to 
achieve  deterministic  optimal  work  of  0{U)  using  0{U /log^  U)  initial  processors.  Improvement 
for  the  case  where  the  number  of  initial  processors  is  U/logU  was  observed  independently  by 


'That  is  we  execute  an  ideal  Common/Arbitrary/Priority  CRCW  PRAM  program  respectively  on  a  Com- 
mon/Arbitrary/Priority  faulty  CRCW  PRAM. 


KhuUer  [Kh89]. 

Maxtel  et  aJ.  [MPS89]  describe  a  probabilistic  algorithm  for  computing  the  maximum  of 
U  elements  assuming  asynchronous  computations.  Their  algorithm  (to  which  we  refer  to  as  a 
Collective  Coupon  Collector  Algorithm  or  CCC-algorithm  for  short)  can  be  immediately  used 
for  robust  computation  of  the  maximum  as  well  as  for  solving  the  Write- All  problem.  A  variety 
of  an  asynchronous  parallel  model  (APRAM)  was  described  by  Cole  and  Zajicek  [CZ89],  who 
presented  algorithms  for  summation,  graph  connectivity,  etc.  on  this  model.  Many  researchers 
have  studied  closely  related  issues,  including  robustness  and  fault  tolerance  in  an  asynchronous 
setting  [A88,AAG87,AS88,DPPU86,P85,SS83]. 

Previous  efforts  at  "adapting"  ideal  PRAM  programs  to  cope  with  imperfections,  such 
as  asynchrony  or  processor  faults,  have  focussed  on  redesigning  specific  algorithms.  Our  paper 
departs  significantly  from  this  approach,  by  providing  a  general  strategy  for  simulating  arbitrary 
PRAM  steps  on  PRAMs  with  faults.  Specifically,  we  show  that  the  complexities  of  solving  the 
Write-All  problem  robustly,  and  implementing  a  step  of  an  ideal  CRCW  PRAM  robustly  are 
identical  up  to  small  constant  multiplicative  time  and  small  constant  per  processor  additive 
space  overheads.  We  do  this  constructively,  by  showing  how  to  use  an  arbitrary  robust  Write- 
All  algorithm  twice^  to  implement  an  arbitrary  step  of  the  ideal  PRAM.  In  effect  we  provide 
a  two-phase  idempotent  execution  strategy  {TIES (or  short)  that  uses  any  robust  algorithm  for 
solving  the  Write- All  problem^,  to  automatically  yield  robust  implementations  of  arbitrary  ideal 
PRAM  steps.  Independently  [Shv89],  Shvartsman  obtained  a  technique,  similar  to  our  strategy 
described  later,  which  allowed  him  to  implement  robustly  any  parallel  algorithm  whose  work 
is  within  a  polylog  factor  of  that  of  the  best  sequential  algorithm  and  whose  local  memory 
requirements  are  within  a  polylog  factor  of  the  problem  size. 

We  identify  the  Write-AD  problem  as  the  core  step  in  robust  implementations  of  PRAM 
algorithms.  Therefore,  solving  it  efficiently  is  critical  to  realizing  such  robust  implementations 
with  low  overhead.  Towards  this  end,  we  cdso  introduce  a  new  and  extremely  simple  algorithm 
based  on  pointer-doubling  (the  PD-algorithm)  for  solving  the  Write-All  problem.  We  analyze 
the  expected  work  as  well  as  the  expected  parcdlel  time  taken  by  this  simple  PD-algorithm 
algorithm,  as  well  by  the  KS-algorithm.  We  show  that  expected  behavior  of  the  PD-algorithm  is 
better  than  that  of  the  KS-algorithm;  detailed  statements  of  these  results  can  be  found  later.  We 
also  show  that  the  expected  work  done  by  a  randomized  parallel  version  of  a  Coupon  Collector 
scheme  is  less  than  that  of  the  deterministic  KS  and  PD  algorithms.  While  the  cdgorithms  we 
study  are  not  always  complicated,  we  found  the  complexity  analysis  of  their  expected  behavior 
to  be  technically  intricate.  Finally,  we  improve  the  lower  bound  (from  [KS89])  on  the  amount 
of  work  that  any  deterministic  algorithm  for  the  Write-AU  algorithm  must  do  in  the  worst 
case;  this  lower  bound  matches  the  expected  work  done  by  our  deterministic  PD-algorithm. 


^Two  points  are  in  order  here.  First,  we  assumt  that  termination  of  the  Write-All  algorithm  is  detectable. 
Second,  we  actually  use  a  modification  of  the  underlying  Write-All  algorithm  as  a  "skeleton"  for  our  technique 
These  notions  will  be  made  precise  later. 

■'Actually,  any  algorithm  that  can  verify  that  all  the  the  given  U  instructions  have  been  touched  or  executed, 
wUl  suffice.  The  Write-AU  problem  is  a  special  case  of  this  more  general  Touch-All  paradigm,  and  is  used  here. 
Computation  of  maximum  or  summation  could  have  been  used  also. 


2  Model  of  failures  and  some  conventions 

Although  our  results  hold  for  arbitrary  CRCW  PRAMs,  for  the  purpose  of  this  paper,  we  as- 
sume the  Common  variant  of  the  CRCW  PRAM  of  [FW78].  When  dealing  with  probabilistic 
algorithms,  we  assume  that  the  individual  processors  of  the  PRAM  are  supplied  with  indepen- 
dent random  number  generators. 

Consider  a  parallel  algorithm  A  that  starts  with  Pq  =  P  available  processors  and  in  parallel 
time  T  completes  its  computation  on  input  data  /^.  The  availability  pattern  11^  indicates  which 
processors  axe  available  or  operational  for  the  algorithm  A  at  each  parallel  time  step.  Let  P, 
be  the  number  of  processors  available  at  time  i  (0  <  i  <  r).  Then, 

Definition  1  The  work  performed  in  the  execution  of  algorithm  A  on  input  I  a,  with  availability 
pattern  H  is  W{A,Ia,'^)  =  Ei'=o^.-'* 

As  in  the  case  of  Kanellakis  and  Shvartsman  [KS89],  we  will  only  be  interested  in  the  fail- 
stop  availability  patterns.  In  such  patterns,  processors  either  stay  available  or  fail  and  once 
they  fail,  they  stay  failed  for  the  remaining  steps  of  the  computation. 

An  availability  pattern  11  is  called  oblivious  if  it  is  determined  before  the  start  of  the  execu- 
tion of  the  algorithm.  It  is  called  Byzantine  if  it  is  created  (by  an  adversary),  adaptively,  during 
the  computation.  In  such  a  ca^e,  the  adversary  determines  the  set  of  available  processors  at  the 
next  time  step  by  examining  the  history  of  the  computation  (which  includes  random  choices 
previously  done)  up  to  that  time. 

The  average  case  analysis  in  this  paper  deals  with  random  failures  where  each  processor 
may  fail  with  a  fixed  probability  g  <  1  in  a  consecutive  sequence  of  time  steps,  referred  to  here 
as  an  epoch  or  period  ;  the  size  of  each  epoch  may  be  a  function  of  {/.  If  all  the  processors  in 
the  PRAM  fail,  we  assume  that  this  condition  can  be  detected  externally. 

3  Summary  of  Results 

We  now  summarize  our  main  contributions.  All  our  expected  case  results  (deterministic  and 
randomized)  summarized  below  hold  with  an  extremely  high  probability  of  at  least  \  —  U~'^  for 
some  7  >  1  controllable  by  the  algorithm  implementor. 


3.1      The  two-phase  execution  strategy 

1.  We  introduce  a  two-phase  idempotent  execution  strategy  that,  in  conjunction  with  any 
robust  algorithm  for  the  Write- All  problem,  can  be  used  to  simulate  an  arbitrary  step  of 
a  PRAM  robustly.  This  TIES  strategy  can  be  implemented  using  a  constant  amount  of 
additional  space  per  functional  processor. 


*This  notion  of  work  generalizes  the  traditional  processor-time  product  definition,  for  the  case  when  there  are 
no  processor  failures  (see  [TV84]  for  example).  It  is  similar  to  the  definition  of  work  in  [MPS89],  although  they 
define  it  in  the  slightly  different  but  essentially  equivalent  model  of  asynchronous  PRAMs.  It  is  also  the  same 
as  the  available  processor  steps  of  [KS89]. 


3.2  Average  case  complexity  results  for  the  Write-All  problein 

We  anaJyze  the  expected  work  as  well  as  the  expected  parallel  time  of  the  KS-algorithm  and 
the  PD-algorithm  for  the  Write-All  problem.  We  assume  a  random  pattern  of  failures  as  stated 
in  section  2,  where  the  failures  can  be  Byzantine  within  each  epoch. 

2.  Average  case  analysis  of  the  KS-algorithm  for  the  Write- All  problem,  with  an  epoch  size 
of  0(log  U).  We  show  that: 

(a)  Its  expected  work  is  0{{P  +  U)\ogU),  where  U  is  the  array  size  and  P  is  the  initial 
number  of  processors. 

(b)  Its  expected  parallel  time  when  U  =  P,  and  the  processor  failure  rate  is  small,  is 
OilogPlogU). 

(c)  Its  expected  parallel  time  when  P  <   cqU  for  some  constant  co   >  0  is  Q{{U  + 
logP)log[/). 

Note  that  the  KS-algorithm  has  a  very  interesting  threshold  in  its  expected  parallel  time, 
which  degrades  very  rapidly  from  0(logPlog  U)  (ca^e  (b)  above)  to  essentially  sequential 
behavior   (case  (c))  even  when  the  number  of  initial  processors  P  is  o{U). 

3.  We  present  a  new  and  extremely  simple  PD-algorithm  for  the  Write-All  problem  which 
is  based  on  pointer  doubling^.  With  an  epoch  size  of  log(?7)  we  show  that: 

(a)  The  expected  parallel  time  of  the  PD-algorithm  is  0{{U/P)  log  U)  when  U/  log  U  < 
P  <U. 

(b)  Hence  its  expected  work  is  0(U\ogU). 

Surprisingly,  this  simple  PD-algorithm  does  significantly  better  in  its  expected  parallel 
time  behavior  than  the  KS-algorithm.  In  particular,  its  expected  parallel  time  stays  stable 
at  0{{U I P)  log  U)  even  when  the  initial  number  of  processors  decreases  to  P  >  U/  log  U. 

3.3  Randomized  algorithms  for  the  Write- All  problem 

In  this  analysis,  we  assume  worst-case  oblivious  availability  patterns  of  processors. 

4.  We  present  a  simple  probabilistic  algorithm  for  the  Write-All  problem,  referred  to  here 
as  the  Concurrent  Coupon  Collector  Algorithm  (or  CCC-algorithm  for  short,  sketched  in 
section  7.1)  based  on  an  approach  of  Martel  et  al.  [MPS89].  We  show  a  threshold  behavior 
in  the  expected  parallel  time  of  this  probabihstic  algorithm  as  well.  (Unfortunately,  as 
noted  in  [MPS89],  its  expected  work  is  0{U  log  U)  and  therefore  it  does  not  improve  on 
the  expected  work  of  the  KS  or  the  PD-algorithms.) 


^Initially,  every  location  i  <  U  m  x  has  a  pointer  to  location  » -I-  1.  As  soon  as  a  1  is  written  in  location  t,  its 
pointer  is  used  by  the  processor  to  find  the  next  location  to  write.  On  subsequent  steps,  processors  use  pointers 
to  double  recursively.  (A  small  initial  segment  of  the  array,  is  handled  separately  later.) 


5.  We  show  that  by  a  minor  modification  to  the  CCC-algorithm,  the  resulting  Adapted 
Coupon  Collector  Algorithm  (or  ACC-algorithm  sketched  in  section  7.2)  does  only 
0{U \og\ogU)  work  provided  P  <  U/logU.  Therefore,  its  expected  work  improves  on 
that  of  the  two  deterministic  algorithms  described  above  even  with  as  many  as  Uf  log  U 
initial  processors.  Martel  et  al.  [MPS89]  analyze  the  expected  work  complexity  of  this 
algorithm  as  well,  but  their  result  appears  to  be  flawed. 

3.4      Lower  bounds  on  worst-case  work 

6.  We  show  that  the  work  done  by  any  algorithm  for  solving  the  Write- All  problem  is  Q{U  + 
PlogU). 

4     Two-phase  idempotent  execution  strategy 

4.1  Model  of  the  ideal  PRAM 

Without  loss  of  generality,  our  ideal  CRCW  PRAM  is  a  simple  modification  of  the  PRAM 
as  described  [AHU74].  To  save  space,  we  ommit  many  details.  The  PRAM  has  U  "virtual" 
processors:  Pi,. . .  ,pu  with  no  local  registers;  all  memory  is  global  and  shared.  The  "application 
memory"  accessible  by  the  processors  is  a  shared  vector  M[l..]  of  some  length.  W.l.o.g.,  the 
processors  execute  a  single  program  whose  instructions  are  listed  in  a  vector  INST[l..I].  The 
instructions  are  as  those  in  [AHU74],  with  appropriate  modifications.  Thus,  a  typical  instruction 
might  be:  M[ji]  :=  M[ji]  +  M[J2]-  There  is  also  a  vector  PC[l..U],  initialized  to  1,  containing 
the  program  counters  of  the  processors.  In  a  single  step  of  the  PRAM,  each  processor  pi  executes 
the  following 

Internal  Program:  Read  PC[i];  if  PC[i]  =  0,  then  halt;  else,  read,  "decode,"  execute 
INST[PC[i]],  and  write  the  new  value  of  PC[i]. 

4.2  The  strategy 

We  assume  the  existence  of  some  robust  algorithm  for  the  Write-All  problem,  which  given  a 
vector  x[l..U^]  initialized  to  0  and  an  availability  pattern  11,  sets  x[i]  :=  1  for  i  =  1, . . . ,  J7  in  r 
steps.  It  uses  some  auxiliary  data  structure  AUX  of  size  0{U)^  appropriately  initialized,  all  of 
whose  locations  are  written  during  the  execution  of  the  algorithm.  Furthermore,  at  the  end  of 
its  execution,  ab  it  variable  DONE  is  set  to  TRUE.  We  implement  a  robust  CRCW  PRAM  in 
a  faulty  CRCW  PRAM  of  the  same  type.  Here  we  assume  the  Common  type  PRAM.  One  step 
of  the  ideal  PRAM  will  take  r  steps^  with  constant  (small)  space  overhead  per  processor. 

Our  (faulty)  CRCW  PRAM,  will  contain  M  and  INST;  two  versions  of  PC  (of  the  original 
application  program),  AUX  and  DONE:  PC. old,  PC.new,  AUX. old,  AUX. new,  DONE. old, 
and  DONE.new;  and  two  new  vectors  LOC[l..U],  VAL[l..U].  In  addition,  we  will  need  space 
to  store  the  instructions  of  the  Write- AH  algorithm,  x,  its  PCs  (which  have  to  be  handled 


*This  is  the  same  number  of  steps  as  that  taken  by  the  Write- All  algorithm;  however  each  step  will  be  slightly 
more  time  consuming. 


carefully),  etc.;  we  do  not  elaborate  on  this  (other  subtleties  are  ignored  too).  Also,  w.l.o.g.  we 
assume  that  M  is  of  the  same  length  as  before.) 

We  now  sketch,  very  briefly,  the  simulation  of  one  step  of  the  ideal  PRAM  while  omitting 
many  important  technical  details.  By  informal  induction  we  assume: 

•  The  values  of  M,  INST,  PC.old  in  the  faulty  PRAM  are  the  same  as  those  of  M,  INST, 
PC  in  the  ideal  PRAM. 

•  The  value  of  A  UX.old  at  the  beginning  of  the  simulation  is  the  same  as  that  of  A  UX  at 
the  beginning  of  the  execution  of  the  robust  algorithm  for  Write- All  and  DONE.old  is 
FALSE. 

The  robust  PRAM  proceeds  in  two  phases,  each  based  on  a  single  execution  of  the  robust 
Write-All  algorithm.  By  the  algorithm,  the  (live)  processors  are  synchronized  between  the 
phases.  The  behavior  of  the  PRAM  is  reminiscent  of  the  deferred  write  approach  in  the  redo/no- 
undo  recovery  protocol  in  database  operating  systems.  Phase  1,  in  effect,  creates  the  deferred 
writes,  phase  2  installs  them.  It  is  important  that  the  phases  are  idempotent,  as  processors  may 
fail  after  doing  some  work,  and  the  various  structures  may  become  inconsistent.  We  describe 
the  two  phases  (not  in  complete  detail)  in  turn. 

4.2.1  Phase  1 

Run  the  underlying  robust  Write-All  algorithm.  Just  before  the  point  where  p^  is  supposed 
to  execute  x[i]  :=  1,  simulate  the  execution  of  INST[PC.old[i]].  However,  if  INST[PC.old[i\] 
assigns  v  to  M[/],  instead  execute  LOC[i]  :=  /,  V>1L[j]  :=  v;  if  the  instruction  does  not  write, 
set  LOC[i]  :=  0.  The  new  value  of  PC[i]  will  be  stored  in  PC.new[i].  In  order  to  avoid  circular 
reduction,  we  have  to  be  concerned  with  "cleaning  up"  of  AUX,  which  is  easily  done.  Before 
A  UX.old^s  location  is  being  modified,  the  corresponding  location  of  A  UX.new  is  initialized.  At 
the  end,  DONE. new  :=  FALSE  and  DONE.old  :=  TRUE  are  executed. 

4.2.2  Phase  2 

This  phase  starts  when  DONE.old  becomes  TRUE.  Again,  using  the  robust  Write- All  al- 
gorithm, values  computed  in  phase  1  are  copied  into  correct  locations:  if  LOC[i]  ^  0  then 
M[LOC[i]]  :=  K4L[i].  Furthermore,  PC.old[i]  :=  PC.new[i].  AUX. new  is  being  used,  and 
A  UX.old  is  being  (re-)initialized.  At  the  end,  DONE.old  :=  FALSE  and  DONE. new  :=  TRUE 
are  executed. 

5      Average-case  analysis  of  the  KS-algorithm 

Let  jo,ji,...  be  the  time  instants  at  which  processors  actually  write  into  the  array  x;  these 
steps  will  be  referred  to  as  writing  epochs  or  epochs.  We  use  Uj  to  denote  the  number  of 
unwritten  locations  at  the  beginning  of  step  j.  Our  availability  patterns  11  are  random  in  the 
sense  that  each  processor,  which  is  operational  at  the  end  of  writing  epoch  jxi  niay  fail  with 
fixed  probability  q  <  I  between  the  epochs  jx  and  jx+i-  H  is  byzantine  otherwise.  Let  Pq  =  P 


be  the  number  of  processors  initially  available,  and  assume  Uq  =  U.  We  will  first  prove  the 
following  useful  technical  theorem. 

Theorem  1  There  are  constants  ei,€2  €  (0,1),  and  a  >  1  that  depend  on  q,  such  that,  as  long 
as  Pj^  >  alogP,  the  inequality  (\  <  Pj^^i/Pj^  <  (2  holds  for  all  such  epochs  jx  with  probability 
at  least  1  —  P~''  for  some  7  >  2. 

Proof  Given  an  epoch  jx  such  that  the  number  of  operational  processors  in  the  beginning  of 
jx  is  Pj^  >  alogP,  and  any  (3  e  (0, 1),  we  have  from  Chernoff  bounds  [C52]  that  if  Ex  is  the 
event  "the  number  of  failed  processors  is  in  the  interval  (1  ±  0)qPji"  then  Prob{£'x}  >  1  — 
exp(-^gP,J.  Choose  any  a  >  ^  and  let  7  =  ^-1  (7  >  2).  Then,  Prob{f:^}  >  l-P-(^+i). 

Now,  let  E  be  the  event  that  Ex  holds  for  all  epochs  jx  such  that  Pj^  >  a  log  P.  Then,  if 
E  is  the  event  "there  is  a  jx  such  that  E  is  violated,"  we  get  ViohlE}  <  '^Prob{Ex}  over  all 
jx  for  which  Pj,  >  alogP,  i.e.  Prob{.E}  <  ^^^~^^+^'  <  i'"^-  Hence,  Prob{i:}  >  1  -  P"" . 
This  completes  the  proof  of  the  Theorem  with  constants  ei  =  1  -  (1  +  f3)q,  62  =  1  -  (1  -  f3)q-  6 

For  convenience  of  analysis,  we  will  view  the  execution  of  algorithm  KS  as  being  split  into 
two  phases.  The  first  phase  contains  all  epochs  jx  such  that  Pj^  >  alogP  and  the  second  phase 
is  the  rest  of  the  algorithm's  execution. 

5.1  Analysis  of  the  first  phase 

For  the  first  phase  we  have: 

Lemma  1    Conditioned  on  event  E,  the  number  of  epochs  of  the  first  phase  is  O(log  P). 

Proof  We  have  Pj^^j  <  f2^ji  for  all  the  epochs  of  the  first  phase.  If  the  number  of  such  epochs 
is  ei,  then  it  must  be  that  Pj,^+^  <  alogP,  i.e.  Pe^*'"*"^'  <  alogP,  i.e.  Ci  =  O(logP).  € 

Corollary  1  Conditioned  on  event  E,  the  amount  of  work  for  the  first  phase  is  Wi  = 
OiPlogU). 

Corollary  2  Conditioned  on  event  E,  the  number  of  parallel  steps  of  algorithm  KS  for  the  first 
phase  is  Pi  =  O(log  Plog  U). 

5.2  Analysis  of  the  second  phase 

To  analyze  the  second  phase  of  algorithm  KS  we  need  the  following  result  of  [KS89]: 

Fact  1  Let  Pi  and  Ui  respectively  denote  the  number  of  operating  processors  and  the  number 
of  unwritten  positions  at  the  beginning  of  epoch  i.  Then,  the  work  from  this  epoch  is  0((P,  + 
U,  +  P,\ogUi)\ogU). 

By  using  this  fact  and  Theorem  1,  we  get: 

Lemma  2  Conditioned  on  event  E,  the  work  in  the  second  phase  of  algorithm  KS  is  W2  = 
0((alogP+  C/  +  alogPlogf/)logf/). 


Having  estimated  the  work  during  the  second  phase,  we  now  analyze  its  expected  paral- 
lel time.  To  do  this,  we  need  to  calculate  the  number  of  unwritten  positions  Uj^  at  the  end 
of  the  first  phase.  Again,  conditioning  on  the  event  E  we  get  (by  an  easy  calculation)  that 
Uj^^  >  U  -  j~^-  We  distinguish  two  cases  depending  on  the  value  of  P: 


5.2.1  Case  1:   U  =  P  and  q  <  1/3. 

Lemma  3   Conditioned  on  event  E,  for  any  epoch  j^  of  the  first  phase  it  is  true  that  Uj^  <  Pj^. 

P  p 

Proof  Conditioned  on  event  E,  we  have  jp-  >  €i  (where  Pj^  =  Pq).    Let  vi  =  -j^.    By  an 

inequality  of  [KS89]  we  have  Uj,  <  Uj,{l  -  ^).  Thus,  Uj,  <  f/(l  -  ^).  Also,  P^,  >  €°P  >  e^U . 
Thus,  f/jj  <  P,j  when  1  -  ^  >  fi,  i.e.   when  ii  >  2/3  i.e.   when  q  <  1/3.   By  repeating  the 
above  argument  (for  q  <  1/3)  the  Lemma  is  proved.  6 
By  Lemma  3  and  given  the  event  E,  we  get  Uj^    =  O(logP),  for  which  we  have: 

Lemma  4  If  U  <  P  and  q  <  1/3,  then  with  probability  at  least  1  -  P~^  for  some  7  >  2,  the 
number  of  parallel  steps  of  the  second  phase  of  algorithm  KS  is  T2  =  0{\ogP]ogU)  and  hence 
its  parallel  time  is  0{\ogP\ogU)  with  the  same  probability. 

5.2.2  Case  2:   t^-  <  XU,  0  <  A  <  1. 

Then,  given  that  E  holds,  we  get  Uj^^  >  (1  -  ^)U  and  P,;.  ^,  <  a  log  P.  In  this  case,  the  number 
of  epochs  of  the  second  phase  is  0{U)  since  the  processors  are  very  few  and  the  number  of 
unwritten  positions  is  very  large.  Thus 

Lemma  5  If  there  is  a  constant  X  £  (0,1)  such  that  P  <  A(l  —  €2)U  ((2  Q*  '"  Theorem  1) 
then,  with  probability  at  least  1  —  P'"^  for  some  7  >  2,  the  number  of  parallel  steps  of  the  second 
phase  of  the  KS-algorithm  is  T2  =  Q{U\ogU)  and  hence  its  parallel  time  is  Q{U\ogU)  with 
the  same  probability. 

5.3      The  expected  work  and  parallel  time  of  the  KS-algorithm 

From  Corollary  2  aind  Lemma  2,  the  claimed  result  (2(a)  from  section  3)  about  the  expected 
work  of  the  KS-algorithm  follows.  As  for  the  claimed  results  2(b)  and  2(c)  (also  from  section 
3)  about  the  expected  parallel  time  of  algorithm-KS,  Corollary  1  together  with  Lemmas  3  and 
4  yield  result  2(b),  whereas  Lemma  5  yields  2(c). 

6      The  PD-algorithm  and  its  average  case  analysis 

6.1      The  algorithm 

We  propose  the  following  simple  algorithm  that  exhibits  stable  expected  parallel  time  behavior, 
and  has  low  expected  work  complexity.   The  PD-algorithm  involves  a  trivial  modification  of 


the  (well-known)  naive  pointer-doubling  technique.  Basically,  in  addition  to  the  shared  array 
a:[l..[/]  of  the  Write-All  problem,  we  assume  that  each  array  location  aJso  contains  a  pointer. 
The  pointers  can  be  implemented  as  an  additional  array  s[l..{7].  Initially,  s[i]  =  i+  1  for  i  <  U 
and  s[U]  —  nil.  Furthermore,  DONE  is  initialized  to  FALSE. 
Algorithm  Pointer-Doubling 

•  Processor  assignment:  For  each  k  {\  <  k  <  P)  processor  k  is  assigned  to  write  1  in  array 
position  x{i]  where  i  =  (A;  -  1)^  -|-  1. 

•  Phase  1:  Each  processor  k  executes  the  following  loop: 
while  s[i]  ^  nil  do 

if  x[5[i]]  =  0  then  x[s[i]]  :=  1;  s[i]  :=  s[s[i]]; 
od 

•  Phase  2:  if  5(2]  =  nil  then 

go  to  the  head  of  the  list  and  execute  phase  1  until  nil  is  encountered,  execute  DONE  := 
TRUE,  and  halt. 

Lemma  6  (Correctness)  If  at  least  one  processor  survives,  all  of  x[l..U]  will  be  initialized  to 
1. 

6.2      Average  case  analysis  of  the  PD-algorithm 

In  the  sequel  we  will  assume  that  the  initial  number  of  processors  is  P  >  U/\ogU.  We  will 
assume  an  availability  pattern  11  which  is  random.  As  before,  we  split  the  parallel  time  into 
consecutive  intervals  or  periods  where  each  period  has  log  U  steps.  Each  processor  which  is 
operational  at  the  beginning  of  a  period  has  a  fixed  probability  g  <  1  of  failing  at  some  step 
during  that  interval,  independently  of  other  processors.  Notice  that  the  availability  pattern 
assumed  here  is  equivalent  to  that  assumed  in  the  average  case  aucdysis  of  the  KS-algorithm 
in  the  sense  that  the  processor  "decay  rate"  is  constant,  measured  over  periods  of  log  [/  steps. 
Patterns  with  more  frequent  failure  (faster  decay  rates)  than  11  are  unacceptable  because  they 
tend  to  exhaust  the  available  processors  before  any  useful  work  can  be  done.  We  analyze  the 
complexity  of  the  PD-algorithm  by  considering  the  following  two  cases. 

6.2.1      Case  1:  P  =  U. 

In  this  case,  each  array  location  is  assigned  to  a  processor  initially.  We  first  show  that 

Lemma  7  There  exists  a  fixed  a  >  1  such  that  after  a  periods,  except  for  an  initial  segment 
of  length  0(\ogU),  the  array  x  will  be  marked  with  probability  at  least  1  -  f/~*,  where  b  >  1 
depends  on  a. 

Let  us  now  condition  the  rest  of  the  analysis  on  the  event  described  in  Lemma  7.  It  follows 
that  the  PD-algorithm  will  terminate  in  O(log  U)  steps  with  probability  1  -  U~^  for  some  6  >  1. 
This  is  because  in  an  additional  amount  of  O(log  U)  steps,  the  number  of  remaining  processors 
will  still  be  a  constant  fraction  of  U,  and  at  least  one  of  them  will  mark  the  (possibly)  unmarked 
initial  segment  of  the  array  x.  Thus, 
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Theorem  2  The  expected  parallel  time  of  the  PD-algorithm,  with  U  processors  initially  is 
0{\ogU).  The  probability  that  the  actual  parallel  time  exceeds  ©(log  [/)  is  less  than  U~^,  for 
some  6  >  1. 


6.2.2       Case  2:  X-^  <  P  <  U  for  some  constant  A  >  1. 

RecaU  that  initially,  each  processor  k  (1  <  k  <  P)  is  assigned  to  array  location  {k  -  l)p  +  1. 
As  in  case  1  above  (Lemma  7),  we  have, 

Lemma  8  Given  b  >  1  we  can  select  an  a  >  I  so  that,  after  a  periods  and  with  probability  at 
least  1  —  U~  ,  the  number  of  operational  processors  will  be  0{P). 

Let  q'  =  1  —  (1  —  q)°'  denote  the  probability  that  a  given  processor  fails  in  a  periods.  As 
always,  /?  is  a  constant  from  the  interval  (0,1).  In  case  2,  once  1  -  (1  +  f3)(^  >  A,  at  least 
J3^  processors  remain  operational  for  alogU  periods  for  a  >  1  with  high  probability  (by 
Lemma  8).  On  the  average,  these  processors  that  remain  operational  after  a  periods  would 
have  started  at  roughly  equidistant  positions  (proved  using  Chernoff  bounds).  From  this,  we 
conclude  that  these  processors  will  complete  the  pointer  doubling  of  the  whole  array  (except 
possibly  for  an  initial  segment),  in  time  0{\ogU)  with  very  high  probability  during  Phase  1 
of  the  PD-algorithm.  The  (possibly)  unmarked  initial  segment  will  be  processed  during  the 
second  phase  of  PD-aJgorithm.  Its  length  provides  an  upper  bound  on  the  time  of  the  second 
phase. 

Lemma  9  The  length  of  the  unmarked  initial  segment  of  the  array  at  the  beginning  of  Phase 
2  of  the  PD-algorithm  is  0(^  log  U)  with  probability  at  least  I  —  U~''  for  some  b  >  1. 

Theorem  3  For  U/logU  <  P  <  U,  the  parallel  time  of  the  PD-algorithm  is  0(^log{/)  with 
probability  at  least  1  -  U~''  for  some  6  >  1,  and  the  expected  parallel  time  is  0(p  log  U).  E 

7     Randomization  can  decrease  the  work  to  0{U\og\ogU) 

We  will  briefly  sketch  the  CCC-algorithm  in  section  7.1.  In  doing  this,  our  main  purpose  is  to 
use  it  in  outlining  the  ACC-algorithm  (in  section  7.2)  that  has  surprisingly  low  overhead.  It 
solves  the  Write- All  problem  in  expected  work  0{U \oglogU)  with  as  many  as  P  <  U/logU 
processors. 

7.1      The  CCC-algorithm 

The  CCC-algorithm  is  a  trivial  variant  of  the  maximum  finding  algorithm  in  asynchronous 
PRAMs,  presented  by  Martel  et  al.  in  [MPS89].  Informally,  view  the  locations  in  the  array  x  of 
the  Write- All  problem  as  the  U  leaves  of  a  binary  tree,  that  is  log  U  deep.  (W.l.o.g.  we  assume 
that  {/  is  a  power  of  two).  The  CCC-algorithm  proceeds  as  follows.  Initially  all  tree  nodes  are 
unmarked.  We  start  with  P  =  U  processors.  Each  functional  processor  selects  a  tree  node  at 
random.  If  the  node  t;  is  a  leaf  or  if  the  children  of  the  node  are  marked,  then  node  v  is  also 
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marked.  This  step  is  repeated  by  all  the  functional  processors  until  the  root  is  marked.  Note 
that  marking  the  root  is  the  same  as  certifying  that  all  the  locations  of  x  have  been  written.  A 
simple  variant  of  the  analysis  in  [MPS89]  shows  that  this  algorithm  does  0{U\ogU)  expected 
work.  We  can  show  that  it  has  an  interesting  threshold  in  its  parallel  time  behavior.  Because 
of  space  constraints,  we  are  unable  to  go  into  details  of  this  issue  here.  Instead,  we  move  on  to 
analyze  the  ACC-algorithm,  that  has  better  expected  work  behavior,  than  any  of  the  algorithms 
analyzed  thus  far. 

7.2      The  ACC-algorithm 

Once  again,  as  in  Martel  et  al.  [MPS89],  we  start  with  U/\og  U  processors  and  divide  the  array 
a:[l..C/]  to  be  marked  into  U/\ogU  subarrays  each  of  size  log^.  Each  logfZ-sized  subarray  is 
now  treated  as  the  leaf  of  a  full  binary  tree  of  2U/  log  f/  -  1  nodes.  The  ACC-algorithm  simply 
involves  running  the  CCC-algorithm  on  the  new  tree,  where  now  marking  a  leaf  of  the  tree 
implies  setting  x[i]  :=  1  for  all  positions  of  the  corresponding  subarray. 

Surprisingly,  this  simple  modification  to  the  CCC-algorithm  only  does  an  expected  work  of 
0(?71oglog  (7)  with  as  many  as  U/logU  processors.  Martel  et  al.  also  attempt  an  analysis  of 
this  algorithm  in  [MPS89]  but  their  amalysis  appears  to  be  flawed. 

Theorem  4  The  expected  work  done  by  the  ACC-algorithm  to  solve  the  Write- All  problem  with 
P  <  U/logU  processors  is  0{U\og\ogU). 

8  A  lower  bound  on  the  work  of  any  deterministic  algorithm 
for  the  Write- All  problem 

Theorem  5  Given  any  deterministic  P-processor  CRCW  PRAM  algorithm  that  solves  the 
Write- All  problem,  an  adversary  can  force  Byzantine  fail-stop  errors  that  result  in  Cl{U  + 
PlogU)  work  being  done  by  the  algorithm. 
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